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Abstract 

In this paper, we propose a novel model for high-dimensional data, called the Hybrid Orthog¬ 
onal Projection and Estimation (HOPE) model, which combines a linear orthogonal projection 
and a finite mixture model under a unified generative modeling framework. The HOPE model 
itself can be learned unsupervised from unlabelled data based on the maximum likelihood es¬ 
timation as well as discriminatively from labelled data. More interestingly, we have shown the 
proposed HOPE models are closely related to neural networks (NNs) in a sense that each hidden 
layer can be reformulated as a HOPE model. As a result, the HOPE framework can be used as a 
novel tool to probe why and how NNs work, more importantly, to learn NNs in either supervised 
or unsupervised ways. In this work, we have investigated the HOPE framework to learn NNs 
for several standard tasks, including image recognition on MNIST and speech recognition on 
TIMIT. Experimental results have shown that the HOPE framework yields signihcant perfor¬ 
mance gains over the current state-of-the-art methods in various types of NN learning problems, 
including unsupervised feature learning, supervised or semi-supervised learning. 

1 Introduction 

Machine learning systems normally consist of several distinct steps in design, namely feature 
extraction and data modeling. In feature extraction, some engineering tricks are used to pre- 
process raw data to extract useful and representative features for the underlying data sets. 
As a result, this stage is sometimes called feature engineering. For high-dimensional data, 
the feature extraction stage needs to distill “good” features that are representative enough to 
discriminate different data samples but also it has to perform effective dimensionality reduction 
to generate less correlated features that can be easily modeled in a lower dimensional space. In 
data modeling, an appropriate model is selected to model data in the lower-dimensional feature 
space. There are a wide range of models available for this purpose, such as k-Nearest-Neighbours 
(kNN) methods, decision trees, linear discriminant models, neural networks, statistical models 
from the exponential family, or mixtures of the exponential family distributions, and so on. 
Subsequently, all unknown model parameters are estimated from available training samples based 
on certain learning criterion, such as maximum likelihood estimation (MLE) or discriminative 
learning. 
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In many traditional machine learning methods, feature extraction and data modeling are 
normally conducted independently in two loosely-coupled stages, where feature extraction pa¬ 
rameters and model parameters are separately optimized based on rather different criteria. 
Particularly, feature extraction is normally regarded as a pre-processing stage, where feature 
extraction parameters are estimated under some quite loose conditions, such as the assumption 
that data is normally distributed in a high-dimensional space as implied in linear discriminant 
analysis (LDA) and principal component analysis (PCA). On the other hand, neural networks 
(NNs) favor an end-to-end learning process, which is normally considered as one exception to 
the above paradigm. In practice, it has been widely observed that NNs are capable of dealing 
with almost any type of raw data directly without any explicit feature engineering. In the re¬ 
cent resurgence of NNs in “deep learning”, more and more empirical results have demonstrated 
that deep neural networks (DNNs) can effectively de-correlate high-dimensional raw data and 
automatically learn useful features from large training sets, without being disturbed by “the 
curse of dimensionality”. However, it still remains as an open question why NNs can handle it 
and what mechanism is used by NNs to de-correlate high-dimensional raw data to learn good 
feature representations for many real-world complicated tasks. 

In this paper, we first propose a novel data modeling framework for high-dimensional data, 
namely Hybrid Orthogonal Projection and Estimation (HOPE). The key argument for the HOPE 
framework is that feature extraction and data modeling should not be decoupled into two sepa¬ 
rate stages in learning and a good feature extraction module can not be learned based on some 
over-simplified and unrealistic modeling assumptions. The feature extraction and data modeling 
must be jointly learned and optimized by considering the complex nature of data distributions. 
This is particularly important in coping with high-dimensional data arising from most real-world 
applications, such as image, video and speech signals. In the HOPE framework, we propose to 
model high-dimensional data by combining a relatively simple feature extraction model, namely 
a linear orthogonal projection, with a powerful statistical model for data modeling, namely a 
finite mixture model of the exponential family distributions, under a unified generative model¬ 
ing framework. In this paper, we consider two possible choices for the mixture models, namely 
Gaussian mixture models (GMMs) and mixtures of the von Mises-Fisher (movMFs) distribu¬ 
tions. First of all, an orthogonal linear projection is used in feature extraction to ensure that the 
highly-correlated high-dimensional raw data is first projected onto a lower-dimensional latent 
feature space, where all feature dimensions are largely de-correlated. This will give us huge ad¬ 
vantages to model data in this feature space rather than the original data space. Secondly, in the 
HOPE framework, we propose to use a powerful model to represent data in the lower-dimensional 
feature space, rather than using any over-simplified models for computational convenience. This 
is very important since any real-world data tend to follow a rather complex distribution, which 
can always be approximated by a finite mixture model up to any arbitrary precision. Thirdly, 
the most important argument in HOPE is that both the orthogonal projection and the mixture 
model must be learned jointly according to a single unified criterion. In this paper, we first study 
how to learn HOPE in an unsupervised manner based on the conventional maximum likelihood 
(ML) criterion and also explain that the HOPE models can also be learned in a supervised way 
based on any discriminative learning criterion. 

Another important finding in this work is that the proposed HOPE models are closely related 
to neural networks (NNs) currently widely used in deep learning. As we will show, any single 
hidden layer in the most popular rectified linear (ReLU) NNs can always be reformulated as 
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a HOPE model consisting of a linear orthogonal projection and a mixture of von Mises-Fisher 
distributions (movMFs). This formulation helps to explain how NNs actually deal with high¬ 
dimensional data and why NNs can de-correlate almost any types of high-dimensional data to 
generate good feature representations. More importantly, this formulation may open up new 
possibilities to learn NNs more effectively. For example, both supervised and unsupervised 
learning algorithms for the HOPE models can be easily applied to learning NNs. By imposing 
an explicit orthogonal constraint on the feature extraction layer, we will show that the HOPE 
methods are very effective in learning NNs for both supervised and unsupervised learning. In 
unsupervised learning, we have shown that the maximum likelihood (ML) based HOPE learn¬ 
ing algorithms can serve as a very effective unsupervised learning method to learn NNs from 
unlabelled data. Our experimental results have shown that the ML-based HOPE learning al¬ 
gorithm can learn good feature representations in an unsupervised way without using any data 
labels. These unsupervised learned features may be fed to some simple post-stage classifiers, 
such as linear SVM, to yield comparable performance as deep NNs supervised learned end-to- 
end with data labels. Our proposed unsupervised learning algorithms significantly outperform 
the previous methods based on the Restricted Boltzmann Machine (RBM) and the autoen¬ 
coder variants [^[^. Moreover, in supervised learning, relying on the HOPE models, we have 
managed to learn some shallow NNs from scratch, which perform comparably with the state-of- 
the-art deep neural networks (DNNs), as opposed to learn shallow NNs to mimic a pre-trained 
deep neural network as in [^. Finally, the HOPE models can also be used to train deep neural 
networks and it normally provides significant performance gain over the standard NN learning 
methods. These results have suggested that the orthogonal constraint in HOPE may serve as a 
good model regularization in learning of NNs. 


2 Related Work 


Dimensionality reduction in feature space is a well-studied topic in machine learning. Among 
many, PCA is the most popular technique in this category. PCA is defined as the orthogonal 
projection of the high-dimensional data onto a lower dimensional linear space, known as the 
principal subspace, such that the variance of the projected data is maximized [^. The nice 
property of PCA is that it can be formulated as an eigenvector problem of the data covariance 
matrix, where a simple closed-form solution exists. Moreover, in the probabilistic PCA 26,29 


PCA can be expressed as the maximum likelihood solution to a probabilistic latent variable 
model. In this case, if the projected data are assumed to follow a zero-mean unit-covariance 
Gaussian distribution in the principal subspace, the probabilistic PCA can also be solved by an 
exact closed-form solution related to the eigenvectors of the data covariance matrix. The major 
limitation of PCA is that it is constrained to learn a linear subspace. Many approaches have 
been proposed to perform nonlinear dimension reduction to learn possible nonlinear manifolds 
embedded within a high dimensional data space. One way to model the nonlinear structure 
is through a combination of linear models, so that we make a piece-wise linear approximation 
to the manifold. This can be obtained by a clustering technique to partition the data set into 
local groups with standard PCA applied to each group, such as m @[28] 


, the high¬ 
dimensional raw data is assumed to follow a mixture model, where each component may perform 
its own maximum likelihood PCA in a local region. However, in these methods, it may be quite 


In 


3 










HOPE (Zhang and Jiang) 


4 


challenging to perform effective clustering or estimate good mixture models for high-dimensional 
data due to the strong correlation in various data dimensions. Alternatively, in [^, a flexible 
nonlinear method is proposed to reduce feature dimension based on a deep auto-associative 
neural network. 

Similar to PCA, the Fisher’s linear discriminant analysis (LDA) can also be viewed as a 
linear dimensionality reduction technique. However, PCA is unsupervised in the sense that 
PCA depends only on the data while Fisher’s LDA is supervised since it uses both data and 
class-label information. The high-dimensional data are linearly projected to a subspace where 
various classes are best distinguished as measured by the Fisher criterion. In 19 , the so-called 


heteroscedastic discriminant analysis (HDA) is proposed to extend LDA to deal with high¬ 
dimensional data with heteroscedastic covariance, where a linear projection can be learned from 
data and class labels based on the maximum likelihood criterion. 


3 Hybrid Orthogonal Projection and Estimation (HOPE) 

Consider a standard PCA setting, where each data sample is represented as a high-dimensional 
vector X with dimensionality D. Our goal is to learn a linear projection, represented as a matrix 
U, to map each data sample onto a space having dimensionality M < D, which is called the 
latent feature space hereafter in this paper. Our proposed HOPE model is essentially a generative 
model in nature but it may also be viewed as a generalization to extend the probabilistic PCA 
in to consider a complex data distribution that has to be modeled by a finite mixture model 
in the latent feature space. This setting is very different from [^, where the original data x is 
modeled by mixture models in the original higher D-dimensional raw data space. 

3.1 HOPE: combining generalized PCA with generative model 

Assume we have a full-size D x D orthogonal matrix U, satisfying U^U = UU^ = I, each data 
sample x in the original D-dimensional data space can be decomposed based on all orthogonal 
row vectors of U, denoted as Uj with i = 1, • • • , D, as follows: 

D 

X = ^ (x • Uj) Uj. (1) 

i=l 

As shown in PCA, each high-dimensional data x can normally be represented fairly precisely 
in a lower-dimensional principal subspace and the contributions from the remaining dimensions 
may be viewed as random residual noises that have sufficiently small variances. Therefore, we 
have 

X = (x • Ui) Ui H-h (x • um) um -b (x • UM-ri) u^+i H-h (x • u^) (2) 

'-V-' '-V-' 

signal component x noise component x 

Here we are interested in learning an M x D projection matrix, denoted as U, to extract 
the signal component x. First of all, if M (M < D) is selected properly, the projection may 
serve as an effective feature extraction for signals as well as a mechanism to eliminate unwanted 
noises from the higher dimensional raw data. This may make the subsequent learning process 
more robust and less prone to overfitting. Secondly, all M row vectors Uj with i = 1, - ■ ■ , M are 
learned to represent signals well in a lower M-dimension space. Furthermore, since all Uj are 
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orthogonal, it implies the latent features are largely de-correlated.This will significantly simplify 
the following learning problem as well. 

In this case, each H-dimension data sample, x, is linearly projected onto an M-dimension 
vector z as z = Ux, where U is an orthogonal matrix, satisfying UU^ = I. Meanwhile, we 
denote the projection of the unwanted noise component x as n, and n can be similarly computed 
as n = Vx, where V is another orthogonal matrix corresponding to all noise dimensions, satis¬ 
fying VV^ = I. Moreover, V is orthogonal to U, i.e. VU^ = 0. In overall, we may represent 
the above projection as follows: 


[z; n] = [U; V]x = Ux (3) 

where U is the above-mentioned D x D orthogonal projection matrix. 

Moreover, it is straightforward to show that the signal component x and the residual noise 
X in the original data space can be easily computed from the above projections as follows: 

i = U'^z = U^Ux (4) 

x = x- i=(I- U^U)x (5) 

In the following, we consider how to learn the projection matrix U to represent H-dimensional 
data well in a lower M-dimension feature space. If this projection is learned properly, we may 
assume the above signal projection, z, and the residual noise projection, n, are independent in 
the latent feature space. Therefore, we may derive the probability distribution of the original 
data as follows: 

p(x) = |U“^| • p(z) • p(n) (6) 

where denotes the Jacobian matrix to linearly map data from the projected space back to 
the original data space. If U is orthonormal, the above Jacobian term equals to one. In this 
work, we follow [29] to assume the residual noise projection n follows an isotropic covariance 
Gaussian distribution in the (D-M)-dimensional space, i.e. p(n) ~ J\f{n \ 0, u^I), where is 
a variance parameter to be learned from data. As for the signal projection z, we adopt a quite 
different approach, as described below in detail. 

In all previous works, the signal projection z is assumed to follow a simple distribution in the 
M-dimension space. For example, z is assumed to follow a zero-mean unit-covariance Gaussian 
distribution in probabilistic PGA |26[[^ . The advantage of this assumption is that an exact 
closed-form solution may be derived to calculate the projection matrix U using the spectral 
methods based on the data covariance matrix. 

However, in most real-world applications, z still locates in a very high-dimensional space 
even after the linear projection, it does not make sense to assume z follows a simple unimodal 
distribution. As widely reported in the literature, it is empirically observed that real-world data 
normally do not follow a unimodal distribution in a high-dimensional space and they usually 
appear only in multiple concentrated regions in the high-dimensional space. More realistically, 
it is better to assume that z follows a finite mixture model in the M-dimension feature space 

^Without losing generality, we may simply normalize the training data to ensure that the residual noises have 
zero mean. 
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because a finite mixture model may theoretically approximate any arbitrary statistical distribu¬ 
tion as long as a sufficiently large number of mixture components are used. For simplicity, we 
may assume z follows a finite mixture of some exponential family distributions: 

K 

p(z) = '^ /fc(z|^fc) (7) 

k=l 

where Hk denotes mixture weights with Ylk=i'^k = Ij and /fc(z|0fc) stands for a unimodal 
distribution from the exponential family with model parameters 0^- We use 0 to denote all 
model parameters in the mixture model, i.e., 0 = {0^, | A: = 1, • • • , K}. In practice, /fc(z|0fc) 

is chosen from the exponential family based on the nature of data. In this paper, we consider 
two possible choices for high-dimensional continuous data, namely the multivariate Gaussian 
distributions and the von Mises-Fisher (vMF) distributions. The learning algorithm can be 
easily extended to other models in the exponential family. 

For example, if we choose the multivariate Gaussian distribution, then z follows a Gaussian 
mixture model (GMM) as follows: 

K K 

p(z) = ^ TTfc •/fc(z|0fc) = ^ TTfc • A7(z I(8) 

k=l k=l 

where A7(z | /r^, denotes a multivariate Gaussian distribution with the mean vector and 
the covariance matrix Since the projection matrix U is orthogonal, all dimensions in z 
are highly de-correlated. Therefore, it is reasonable to assume each Gaussian component has a 
diagonal covariance matrix This may significantly simplify the model estimation of GMMs. 

Alternatively, we may select a less popular model, i.e., the von Mises-Fisher (vMF) distri- 
bution0 The vMF distribution may be viewed as a generalized normal distribution defined on 
a high-dimensional spherical surface. In this case, z follows a mixture of the von Mises-Fisher 
(movMF) distributions as follows: 


K K 

P{^) = Y1 ^k- fki^lOk) = Y1 -CMd/^fcl) (9) 

k=l k=l 

where z is located on the surface of an M-dimensional sphere, i.e., |z| = 1, /.t^, denotes all model 
parameters of the k-th. vMF component and it is an M-dimensional vector in this case, and 
Cm{i^) is the probability normalization term of the k-th. vMF component defined as: 

^M/2-l 

^ ( 10 ) 

where Iv{-) denotes the modified Bessel function of the first kind at order v. 

^The main reason why we are interested in the von Mises-Fisher (vMF) distributions is that the choice of the 
vMF model can strictly link our HOPE model to regular neural networks in deep learning. We will elucidate this 
later in this paper. 
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4 Unsupervised Learning of HOPE Models 

Obviously, the HOPE model is essentially a generative model that combines feature extraction 
and data modelling together, and thus its model parameters, including both the projection ma¬ 
trix and the mixture model, can be estimated based on the maximum likelihood (ML) criterion, 
just like normal generative models as well as the probabilistic PCA in 29 . However, since z 


follows a mixture distribution, no closed-form solution is available to derive either the projec¬ 
tion matrix or the mixture model. In this case, some iterative optimization algorithms, such as 
stochastic gradient descent (SGD) [^[^, may be used to jointly estimate both the projection 
matrix U and the underlying mixture model altogether to maximize a joint likelihood function. 
In this section, we assume the projection matrices, not only U but also the whole U, are all 
orthonormal. As a result, the Jacobian term in eq.Q disappears since it equals to one. Refer 
to Appendix (A| for the case where U is not orthonormal. 

Given a training set is available as X = {x„ | n = 1, ■ ■ ■ , N}, assume that all are 
normalized to be of unit length as |x„| = 1, the joint log-likelihood function related to all HOPE 
parameters, including the projection matrix U, the mixture model 0 = {Ok\k = 1, • • • , K} and 
residual noise variance a, can be expressed as follows: 


N 


N r 


£(U, 0,fT|X) = InPr(xn) = 


n=l 

N 


n=l 


lnPr(z„) -I- lnPr(n„) 


K 


N 


^ In ^ TTfc • fk{\JXn\9k) + X] I 


n=l 


\k=l 


/ n=l 


A(U,©) 


>C2(U,^) 


( 11 ) 


The HOPE parameters, including U, 0 and u, can all be estimated by maximizing the above 
likelihood function as: 


{U*,0*,^7*} = argmaxu,e,<, £(U,0,a| X) 
subject to the orthogonal constraint: 


UU^ = I. 


( 12 ) 


(13) 


There are many methods to enforce the above orthogonal constraint in the optimization. For 
example, we may periodically apply the Gram-Schmidt process in linear algebra to orthogonalize 
U during the optimization process. In this work, for computational efficiency, we follow to 
cast the orthogonal constraint condition in eq.(13) as a penalty term in the objective function 


to convert the above constrained optimization problem into an unconstrained one as follows: 


{U*,0*,cr*} = argmaxu,©,o 


£(U,0,u I X) -/3-T'(U) 


(14) 


where /3 (/3 > 0) is a control parameter to balance the contribution of the penalty term, and the 
penalty term T’(U) is a differentiable function as: 

MM, I 
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with Uj denoting the z-th row vector of the projection matrix U, and Uj • n^ representing the 
inner product of Uj and Uj. The norms of all row vectors of U need to be normalized to one 
after each update. 

In this work, we propose to use the stochastic gradient descent (SGD) method to optimize 
the objective function in eq.(14). In this case, given any training data or a mini-batch of them, 
we calculate the gradients of the objective function with respect to the projection matrix, U, and 
the parameters of the mixture model, 0, and then update them iteratively until the objective 
function converges. The gradients of Ti(U, 0) depends on the mixture model to be used. In 
the following, we first consider how to compute the derivatives of T’(U) and T 2 (U, a), which are 
general for all HOPE models. After that, as two examples, we will show how to calculate the 
derivatives of Ti(U, 0) for GMMs and movMFs. 


4.1 Dealing with the penalty term ViJJ) 

Following [^, the gradients of the penalty term P(U) with respect to each row vector, Uj 
(z = 1, • • • , M), can be easily derived as follows: 


dV{\]) 

dui 


M 

E 

i=i 


9ij 




Ui • u. 


Uj • Uj 


(16) 


where gij denotes the absolute cosine value of the angle between two row vectors, Uj and Uj, 
computed as follows: 

Uj • Uj 


The above derivatives can be equivalently represented as the following matrix form: 



dV{U) 

dV 


(D - B)U 


(18) 


where D is an M x M matrix, with its elements computed as dij = (1 

and B is an M X M diagonal matrix, with its diagonal elements computed as bu = 
i < M). 


<i,j < M), 


Ej = l 9ij 

Ui-Ui 


( 1 < 


4.2 Dealing with the noise model term £2 

The log-likelihood function related to the noise model, £ 2 (U, a), can be expressed as: 

£2(U, a) = in(cT2) n^Un. (19) 

71=1 

And we have: 

— (^72 U U U XJXt^) U XJx^) 

= x^x„ - 2x^U^Ux„ + x^U^UU^Ux„ 

8 


( 20 ) 
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Therefore, £ 2 ( 0 , cj) can be expressed as: 

N 

2(j2 


n=l 


xlxn - 2x^U^Ux„ + x^U^UU^Ux„ 


( 21 ) 


The gradient with respect to U can be derived as follows; 


dC2{lJ,a) 

dU 


71=1 

1 ^ 


2Ux„x^ - UU^Ux^x^ - Ux„x^U^U 


71=1 


Xn(Xn) TXn(x^) 


( 22 ) 


For the noise variance cr^, we can easily derive the following closed-form update formula by 
vanishing its derivative to zero: 


cr = 


N 


NiD - M) 


n^n„ 


(23) 


As long as the learned noise variance ci^ is small enough, maximizing the above term £2 
will force all signal dimensions into the projection matrix U and only the residual noises will be 
modelled by £ 2 . 


4.3 Computing £1 for GMMs 

In this section, we consider how to compute the partial derivatives of £i(U,0) for GMMs. 
Assume each mini-batch X consists of a small subset of randomly selected training samples, 
X = {x„ I n = I,-- - ,A^}, the log likelihood function of HOPE models with GMMs can be 
represented as follows: 


£i(U,0) 


N 


E 

n=l 


In 


■ K 

E 

Lfe=i 


TTfc • AA(Ux„ 




(24) 


The partial derivative of £1 (U, 0) w.r.t the mean vector, /r^, of the k-th Gaussian component 
can be calculated as follows: 


9£i(U,0) 


N 

^7fc(z„) • E"^(z„ -/Xfc) 
71=1 


(25) 


where z„ = Ux„, and 'jki'^n) denotes the so-called occupancy statistics of the A:-th Gaussian 
component, computed as 7 fc(z„) = _ 

E 7rjA/'(z„|/x,,Ej) 

^We may use the constraint UU^ = I to significantly simplify the above derivation. However, that leads to 
a gradient computation strongly relying on the orthogonal constraint. Since we use SGD to iteratively optimize 
all model parameters, including U. We can not ensure UU^ = I strictly holds anytime in the SGD process. 
Therefore, the simplified gradient usually yields poor convergence performance. 
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The partial derivative of >Ci(U, 0) w.r.t tt^ can be simply derived as follows: 

N 


d£i(U,0)_^ 7fc(zn) 


d-JTk 


n=l 


'^k 


(26) 


The partial derivative of Ti(U, 0) w.r.t the is computed as follows: 


9A(U,Q) 

d^k 


1 ^ 


n=l 


^ - Mfc)(Zn - Mfc)" S 


Tv^-l 


(27) 


When we use the above gradients to update Gaussian covariance matrices in SGD, we have 
to impose additional constraints to ensure all covariance matrices are positive semidefinite. 
However, if we adopt diagonal covariance matrices for all Gaussian components, these constraints 
can be implemented in a fairly simple way. 

Finally, the partial derivative of £i(U, 0) w.r.t the projection matrix U is computed as: 


a£i(U,0) 

dU 


N K 

■ ^kHf^k - Zn)x^. 

n=lk=l 


(28) 


4.4 Computing Ci for movMFs 

Similarly, we derive all partial derivatives of Ti(U, 0) for a mixture of vMFs (movMFs). In 
this case, given a mini-batch of training samples, X = {x^ | n = 1, • • • ,N}, the log-likelihood 
function of the HOPE model with movMFs can be expressed as follows: 


A(U,0) 


N 


E 

n=l 


In 


K 

E 

Lfc=i 


TTfe -CMd/^fcl) • 


(29) 


where each z„ must be normalized to be of unit length as required by the vMF distribution 
as: 


Zji — I . 

Z 77 , 


(30) 


Similar to the HOPE models with GMMs, we first define an occupancy statistic for fc-th 
vMF component as: 

TTfc -CMd/Xfcl) • 


7fc(Zn) — 


In a similar way, we can derive the partial derivatives of £i(U, 0) with respect to vrfc,/x^ 
and U as follows: 


(31) 


9A(U,0) 

d-Kk 


N 

E 

74=1 


7fc(Zn) 


'^k 


(32) 


"^In practice, we usually normalize all original data, x„, to be of unit length: |x„| = 1, prior to the HOPE 
model. In this case, as long as M is properly chosen (e.g., to be large enough), the projection matrix U is 
always learned to extract from x„ as much energy as possible. Therefore, this normalization step may be skipped 
because the norm of the projected z„ is always very close to one even without normalization in this stage, i.e., 

|Zn| = |Z„| « 1. 
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Algorithm 1 SGD-based Maximum Likelihood Learning Algorithm for HOPE 
randomly initialize Uj (i = 1, • • • , M), and Ok (/c = 1, • • • , K) 
for epoch = 1 to T do 

for minibatch X in training set do 

(aCi(v,@) dC 2 {v,a) ^ dv(u)) 

^ ^ ^ -I-1 ■ Qu -f- gjj P ' au J 

Ok^Ok + e- (Vk) 

VTfc TTfc + e • (Vk) 

2 , 1 T 

^ ^ N{D-M) ^n=l 

VTfc ^ (V/c) and Uj ^ (Vi) 

end for 
end for 


9A(U,0) 

do-k 


N 

7fc(Zn) 

n=l 



Hk ^M/2(\0'k\) 
O'kl lM/2-li\0'k\). 


dCi{U,&) 

dV 


N K 


E E 


7fc(Zn) 

|Zn| 


• (I - Z„Z^)/^fcX^ 


(33) 

(34) 


Refer to Appendix for all details on how to derive the above derivatives for the movMF 
distributions. Moreover, when movMFs are used, we need some special treatments to compute 
the Bessel functions in vMFs, i.e, Ivi'), as shown in eqs.(31) and (33). In this work, we adopt 
the numerical method in to approximate the Bessel functions, refer to the Appendix for 
the numerical details on this. 


4.5 The SGD-based Learning Algorithm 

Because all mixture weights, vr^ (k = 1, • • ■ K), and all row vectors, Uj (i = 1, • • • ,M) of the 
projection matrix satisfy the constraints; = 1 and |ujj = 1 (Vj). During the SGD 

learning process, and u* must be normalized after each update as follows: 




'^k 


Ui 


U, 


U,; 


(35) 

(36) 


Finally, we summarize the SGD algorithm to learn the HOPE models based on the maximum 
likelihood (ML) criterion in Algorithm 1. 


5 Learning Neural Networks as HOPE 

As described above, the HOPE model may be used as a novel model for high-dimensional data. 
The HOPE model itself can be efficiently learned unsupervised from unlabelled data based on 
the above-mentioned maximum likelihood criterion. Moreover, if data labels are available, a 
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variety of discriminative training methods, such as those in 115-17 , may be used to learn the 


HOPE model supervised based on some other discriminative learning criteria. 

More interestingly, as we will elucidate here, there exists strong relationship between the 
HOPE models and neural networks (NNs). First of all, we will show that the HOPE models 
may be used as a new tool to probe the mechanism why NNs work so well in practice. Under the 
new HOPE framework, we may explain why NNs can almost universally excel on a variety of data 
types and how NNs can handle various types of highly-correlated high-dimensional data, which 
may be quite challenging to many other machine learning models. Secondly, more importantly, 
the HOPE framework provides us with some new approaches to learn NNs: (i) Unsupervised 
learning: the maximum likelihood estimation of HOPE may be directly applied to learn NNs 
from unlabelled data; (ii) Supervised learning: the HOPE framework can be incorporated into 
the normal supervised learning of NNs by explicitly imposing some orthogonal constraints in 
learning. This may improve the learning of NNs and yield better and more compact models. 


5.1 Linking HOPE to Nenral Networks 




Eigure 1: Illustration of a HOPE model as a layered network structure in (a). It may be 
equivalently reformulated as a hidden layer in neural nets shown in (b). 


A HOPE model normally consists of two stages: i) a linear orthogonal projection from the 
raw data space to the latent feature space; ii) a generative model defined as a hnite mixture 
model in the latent feature. As a result, we may depict every HOPE model as a two-layer 
network: a linear projection layer and a nonlinear model layer, as shown in Eigure [H (a). The 
first layer represents the linear orthogonal projection from x (x G ) to z (z G z = Ux. 

The second layer represents the underlying finite mixture model in the feature space. Taking 
movMFs as an example, each node in the model layer represents the log-likelihood contribution 
from one mixture component as follows: 

(/)fc = ln(7rfc • fk{z\Ok)) = InvTfc -h InCuHtJ-kl) + z • /.tfc. (37) 

Given an input x (assuming x is projected to z in the latent feature space), if we know all 
4’k ^ k < K) in the model layer, we can easily compute the log-likelihood value of x from the 
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HOPE model in eq.Q as follows: 

K 

Inp(z) = ln^exp((/>fc). (38) 

k=l 


Moreover, all (1 < A: < K) may be used as a set of distance measurements to locate the 
input projection, z, in the latent space as trilateration, shown in Figure Therefore, all 
(1 < A: < K) may be viewed as a set of features to distinctly represent the original input x. 



Figure 2: Illustration of the HOPE features as trilateration in the latent feature space. 


Furthermore, we may use a preset threshold e to prune the raw measurements, 4>k ^ k < 

K), as follows: 

rjk = max(0, - e) (39) 

to eliminate those small log likelihood values from some faraway mixture components. Pruning 
small log-likelihood values as above may result in several benefits. Firstly, this pruning operation 
does not affect the total likelihood from the mixture model because it is always dominated by 
only a few top components. Therefore, we have: 

K 

Inp(z) ~ e -|- In E exp(ryfc). 

k=l 

Secondly, this pruning step does not affect the trilateration problem in Figure This is similar 
to the Global Positioning System (GPS) where the weak signals from faraway satellites are not 
used for localization. More importantly, the above pruning operation may improve robustness 
of the features since these small log-likelihood values may become very noisy. Note that the 
above pruning operation is similar to the rectihed linear nonlinearity in regular ReLU neural 
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networks. Here we give a more intuitive explanation for the popular ReLU operation under the 
HOPE framework. 

In this way, as shown in Figure (a), all rectified log likelihood values in the model layer, 
i.e., r]k (1 < k < K), may be viewed as a sensory map to measure each input, x, in the latent 
feature space using all mixture components as the probers. Each pixel in the map, i.e., a pruned 
measurement rjk, roughly tells the distance between the centroid of a mixture component and 
the input projection z in the latent feature space. Under some condition (e.g., K M), the 
input projection can be precisely located based on these pruned rjk values as a trilateration 
problem in the M-dimensional feature space, as shown in Figure Therefore, this sensory map 
may be viewed as a feature representation learned to represent the input x, which may be fed 
to a softmax classifier to form a normal shallow neural network, or to another HOPE model to 
form a deep neural network. 

Moreover, since the projection layer is linear, it can be mathematically combined with the 
upper model layer to generate a single layer structure, as shown in Figure (b). If movMFs 
are used in HOPE, it is equivalent to a hidden layer in normal rectified linear (ReLU) neural 
networks}^ And the weight matrix in the merged layer can be simply derived from the HOPE 
model parameters, U and 0. It is simple to show that the weight vectors for each hidden node 
k {1 < k < K) in Figure (b) may be derived as 

Wfc = 


and its bias is computed as 

bk = InvTfc + InCM(lRfcl) - £• 


Even though the linear projection layer may be merged with the model layer after all model 
parameters are learned, however, it may be beneficial to keep them separate during the model 
learning process. In this way, the model capacity may be controlled by two distinct control 
parameters: i) M can be selected properly to filter out noise components as in eq. © to prevent 
overfitting in learning; ii) K may be chosen independently to ensure the model is complex enough 
to model very big data sets for more difficult tasks. Moreover, we may enforce the orthogonal 
constraint, i.e., UU^ = I, during the model learning to ensure that all dimensions of z are 
mostly de-correlated in the latent feature space, which may significantly simplify the density 
estimation in the feature space using a finite mixture model. 

The formulation in Eigure (a) helps to explain the underlying mechanism how neural 
networks work. Under the HOPE framework, it becomes clear that each hidden layer in neural 
networks may actually perform two different tasks implicitly, namely feature extraction and data 
modeling. This may shed some light on why neural nets can directly deal with various types 
of highly-correlated high-dimensional data |23| without any explicit dimension reduction and 
feature de-correlation steps. 

Based on the above discussion, a HOPE model is mathematically equivalent to a hidden 
layer in neural networks. Eor example, each movMF HOPE model can be collapsed into a single 
weight matrix, same as a hidden layer of ReLU NNs. On the contrary, any weight matrix in 
ReLU NNs may be decomposed as a product of two matrices as in Figure (a) as long as 
M is not less than the rank of the weight matrix. In practice, we may deliberately choose a 


®On the other hand, if GMMs are used in HOPE, it can be similarly shown that it is equivalent to a hidden 
layer in Radial basis function (RBF) networks 
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smaller value for M to regularize the models. Under this formulation, it is clear that neural 
networks may be trained under the HOPE framework. There are several advantages to learn 
neural networks under the HOPE framework. First of all, the modelling capacity of neural 
networks may be explicitly controlled by selecting proper values for M and K, each of which is 
chosen for a different purpose. Secondly, we can easily apply the maximum likelihood estimation 
of HOPE models in section to unsupervised or semi-supervised learning to learn NNs from 
unlabelled data. Thirdly, the useful orthogonal constraints may be incorporated into the normal 
back-propagation process to learn better NN models in supervised learning as well. 


5.2 Unsupervised Learning of Neural Networks as HOPE 

The maximum likelihood estimation method for HOPE in section |4] can be used to learn neural 
networks layer by layer in an unsupervised learning mode. All HOPE model parameters in 
Figure[^(a) are first estimated based on the maximum likelihood criterion as in section]^ Next, 
the two layers in the HOPE model are merged to form a regular NN hidden layer as in Figure 
(b). In this case, class labels are not required to learn all network weights and neural networks 
can be learned from unlabelled data under a theoretically solid framework. This is similar to 
the Hebbian style learning 25 but it has a well-founded and converging objective function in 


learning. As described above, the rectified log-likelihood values from the HOPE model, i.e.. 
Ilk ^ k < K), may be viewed as a sensory map using all mixture components as the probers 
in the latent feature space, which may serve as a good feature representation for the original 
data. At the end, a small amount of labelled data may be used to learn a simple classifier, either 
a softmax layer or a linear support vector machine (SVM), on the top of the HOPE layers, which 
takes the sensory map as input for final classification or prediction. 

In unsupervised learning, the learned orthogonal projection matrix U may be viewed as a 
generalized PCA, which performs dimension reduction by considering the complex distribution 
modeled by a finite mixture model in the latent feature space. 


5.3 Supervised Learning of Neural Networks as HOPE 


The HOPE framework can also be applied to the supervised learning of neural networks when 
data labels are available. Let us take ReLU neural networks as example, each hidden layer in 
a ReLU neural network, as shown in Figure]^ (b), may be viewed as a HOPE model and thus 
it can be decomposed as a combination of a projection layer and a model layer, as shown in 
Figure(a). In this case, M needs to be chosen properly to avoid overfitting. In other words, 
each hidden layer in ReLU NNs is represented as two layers during learning, namely a linear 
projection layer and a nonlinear model layer. If data labels are available, as in [l6, 17 


instead of 


using the maximum likelihood criterion, we may use other discriminative training criteria |15| to 
form the objective function to learn all network parameters. Let us take the popular minimum 
cross-entropy error criterion as an example, given a training set of the input data and the class 
labels, i.e., X = | 1 < t < T}, we may use the HOPE outputs, i.e., all rj^ in eq.(39) to 

form the cross-entropy objective function as follows: 


JEci?(U, /Xfc, I X) = - ^ log 


t=i 


exp (%(xt)) 
-Ef=iexp(%(xi)). 


(40) 


15 










HOPE (Zhang and Jiang) 


16 


Obviously, the standard back-propagation algorithm may be used to optimize the above 
objective function to learn all decomposed HOPE model parameters end-to-end, including U, 

I 1 < k < K} from all HOPE layers. The only difference is that the orthogonal 
constraints, as in eq.(13), must be imposed for all projection layers during training, where the 
derivatives in eq.(18) must be incorporated in the standard back-propagation process to update 
each projection matrix U to ensure it is orthonormal. Note that in the supervised learning based 
on a discriminative training criterion, the unit-length normalization of the data projections in 
eq.(30) can be relaxed for computational simplicity since this normalization is only important 
for computing the likelihood for pure probabilistic models. After the learning, each pair of 
projection and model layers can be merged into a single hidden layer. After merging, the 
resultant network remains the exactly same network structure as normal ReLU neural networks. 
This learning method is related to the well-known low-rank matrix factorization method used 
for training deep NNs in speech recognition [27[|31| . However, since we impose the orthogonal 
constraints for all projection matrices during the training process, it may lead to more compact 
models and/or better performance. 

In supervised learning, the learned orthogonal projection matrix U may be viewed as a 
generalized LDA or HD A 19 , which optimizes the data projection to maximize (or minimize) 
the underlying discriminative learning criterion. 


5.4 HOPE for Deep Learning 

As above, the HOPE framework can be used to learn rather strong shallow NN models. However, 
this does not hinder HOPE from building deeper models for deep learning. As shown in Figure 
we may have two different structures to learn very deep neural networks under the HOPE 
framework. In Figure]^ (a), one HOPE model is used as the first layer primarily for feature 
extraction and a deep neural network is concatenated on top of it as a powerful classifier to form 
a deep structure. The deep model in Figure]^ (a) may be learned in either supervised or semi- 
unsupervised mode. In semi-unsupervised learning, the HOPE model is learned based on the 
maximum likelihood estimation and the upper deep NN is learned supervised. Alternatively, if 
we have enough labelled data, we may jointly learn both HOPE and DNN in a supervised mode. 
In Figure]^ (b), we may even stack multiple HOPE models to form another deep model structure. 
In this case, each HOPE model generates a sensory feature map in each HOPE layer. Just like all 
pixels in a normal image, all thresholded log-likelihood values in the sensory feature map are also 
highly correlated, especially for those mixture components locating relatively close in the feature 
space. Thus, it makes sense to add another HOPE model on top of it to de-correlate features and 
perform data modeling at a finer granularity. The deep HOPE model structures in Figure]^ (b) 
can also be learned in a either supervised or unsupervised mode. In unsupervised learning, these 
HOPE layers are learned layer-wise using the maximum likelihood estimation. In supervised 
learning, all HOPE layers are learned in back-propagation with orthonormal constraints being 
imposed to all projection layers. 


6 Experiments 

In this section, we will investigate the proposed HOPE framework in learning neural networks for 
several standard image and speech recognition tasks under several different learning conditions: 
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Figure 3: Two structures to learn deep networks with HOPE: (a) Stacking a DNN on top of one 
HOPE layer; (b) Stacking multiple HOPE layers. 


1) unsupervised feature learning; ii) supervised learning; hi) semi-supervised learning. The 
examined tasks include the image recognition tasks using the MNIST data set, and the speech 
recognition task using the TIMIT data set. 

6.1 MNIST: Image Recognition 

The MNIST data set consists of 28 x 28 pixel greyscale images of handwritten digits 0- 
9, with 60,000 training and 10,000 test examples. In our experiments, we first evaluate the 
performance of unsupervised feature learning using the HOPE model with movMFs. Secondly, 
we investigate the performance of supervised learning of DNNs under the HOPE framework, 
and further study the effect of the orthogonal constraint in the HOPE framework. Einally, we 
consider a semi-supervised learning scenario with the HOPE models, where all training samples 
(without labels) are used to learn feature representation unsupervised and then a portion of 
training data (along with labels) is used to learn post-stage classification models supervised. 

6.1.1 Unsupervised Feature Learning on MNIST 

In this experiment, we first randomly extract many small patches from the original unlabelled 
training images on MNIST. Each patch is of 6-by-6 in dimension, represented as a vector in 
with D = 36. In this work, we randomly extract 400,000 patches in total from the MNIST train¬ 
ing set for unsupervised feature learning. Moreover, every patch is normalized by subtracting 
the mean and being divided by the standard deviation of its elements. 

In the unsupervised feature learning, we follow the same experimental setting in [^, where 
an unsupervised learning algorithm is used to learn a “black box” feature extractor to map each 
input vector in to another i^-dimension feature vector. In this work, we have examined 
several different unsupervised learning algorithms for feature learning: (i) kmeans clustering; 
(ii) spherical kmeans (spkmeans) clustering; (hi) mixture of vME (movMF), (iv) PCA based 
dimension reduction plus movMF (PCA-movME); and (v) the HOPE model with movMEs 

®Matlab codes are available at https://wiki.eecs.yorku.ca/lab/MLL/projects:hope:start for readers to 
reproduce all MNIST results reported in this paper. 
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(HOPE-movMF). As for kmeans and spkmeans, the only difference is that different distance 
measures are used in clustering: kmeans uses the Euclidean distance while spkmeans uses the 
cosine distance. As for the movMF model, we can use the expectation maximization (EM) algo¬ 
rithm for estimation, as described in [^. In the following, we briefly summarize the experimental 
details for these feature extractors. 

1. kmeans: We first apply the k-means clustering method to learn K centroids from 

all extracted patch input vectors. For each learned centroid we use a soft threshold 
function to compute each feature as: fk{x) = max(0, |x — — e), where e is a pre-set 

threshold. In this way, we may generate a A'-dimension feature vector for each patch input 
vector. 

2. spkmeans: As for the spk-means clustering, we need to normalize all input patch vectors 
to be of unit length before clustering them into K different centroids based on the cosine 
distance measure. Given each learned centroid pij^, we can compute each feature as /fc(x) = 
max(0, — e). 

3. movMF: We also need to normalize all input patch vectors to be of unit length. We use 

the EM algorithm to learn all model parameters For each learned centroid of the 
movMF model, we compute one feature as fk{x) = max (O, ln(7rfc) -|-ln(C7v(|Aifc|)) + “ 

e). 

4. PC A-movMF: Comparing to movMF, the only difference is that we first use PCA for 

dimension reduction, reducing all input patch vectors from to M^. Then, we use 
the same method to estimate an movMF model for the reduced H-dimension feature 
vectors. In this experiment, we set M = 20 to reserve 99.5% of the total sum of all 
eigenvalues in PCA. For each learned vMF model we compute one feature as fkix) = 
max (0,ln(7rfc) -hln(Cv(|Atfc|)) -e). 

5. HOPE-movMF: We use the maximum likelihood estimation method described in section 
1^ to learn a HOPE model with movMFs. For each learned vMF component, we can 
compute one feature as fk{x) = max(0,ln(7rfc) -|- ln(CM(|HA:l)) + ^ • /.tfc — e). Similar to 
PC A-movMF, we also set M = 20 here. 

Furthermore, since HOPE-movMF is learned using SGD, we need to tune some hyper-parameters 
for HOPE, such as learning rate, mini-batch size, (3 and cr^. In this work, the learning rate is 
set to 0.002, minibatch size is set to 100, we set /3 = 1.0, and the noise variance is manually set 
to cj^ = 0.1 for convenience. 

After learning the above models, they are used as feature extractors. We use the same 
method as described in to generate a feature representation for each MNIST image, where 
each feature extractor is convolving over an MNIST image to obtain the feature representations 
for all local patches in the image. Next, we split the image into four equally-sized quadrants 
and the feature vectors in each quadrant are all summed up to pool as one feature vector. 
In this way, we can get a diP-dimensional feature vector for each MNIST image, where K is 
the number of all learned features for each patch. Finally, we use these pooled dA-dimension 
feature vectors for all training images, along with the labels, to estimate a simple linear SVM 
as a post-stage classifier for image classification. The experimental results are shown in Table 
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Table 1: Classification error rates (in %) on the MNIST test set using supervised learned linear 
SVMs as classifiers and unsupervised learned features (4K-diniension) from different models. 


model / K= 

400 

800 

1200 

1600 

kmeans 

1.41 

1.31 

1.16 

1.13 

spkmeans 

1.09 

0.90 

0.86 

0.81 

movMF 

0.89 

0.82 

0.81 

0.84 

PCA-movMF 

0.87 

0.75 

0.73 

0.74 

HOPE-movMF 

0.76 

0.71 

0.64 

0.67 


We can see that spkmeans and movMF can achieve much better performance than kmeans. The 
PCA-based dimension reduction leads to further performance gain. Finally, the jointly trained 
HOME model with movMFs yields the best performance, e.g., 0.64% in classification error rate 
when K = 1200. This is a very strong performance for unsupervised feature learning on MNIST. 


6.1.2 Supervised Learning of Neural Networks as HOPE on MNIST 

In this experiment, we use the MNIST data set to examine the supervised learning of rectified 


linear (ReLU) neural networks under the HOPE framework, as discussed in section 5.3 
Here we follow the normalized initialization in 


10 to randomly initialize all NN weights. 


without using pre-training. We adopt a small modihcation to the method in 10 by adding a 
factor to control the dynamic range of initial weights as follows: 


Wi 





\/^i T 'IT'i+l 


7- 


\/^i T 'IT'i+l 


(41) 


where n* denotes the number of units in the f-th layer. For the ReLU units, due to the unbounded 
behavior of the activation function, activations of ReLU units might grow unbounded. To handle 
this numerical problem, we shrink the dynamic range of initial weights by using a small factor 
7 (7 = 0.5), which is equivalent to scaling the activations. 

We use SGD to train ReLU neural networks using the following learning schedule: 


e 


t 


€0 ■ a * 


(42) 


m 


t 


T'mf + (1 

ruf 


j^)mi t < T 
t > T 


(43) 


where e* and m* denote the learning rate and momentum for the t-th epoch, and we set all 
control parameters as m* = 0.5, m/ = 0 . 99 , 0 ; = 0.998. We totally run T = 50 epochs for 
learning without dropout and run T = 500 epochs for learning with dropout. Moreover, the 
weight decay is used here and it is set to 0.00001. Furthermore, for the HOPE model, the control 
parameter for the orthogonal constraint, (3, is set to 0.01 in all experiments. In this work, we 
do not use any data augmentation method. 

Under the HOPE framework, we decompose each ReLU hidden layer into two layers as 
in Eigure (a). In this experiment, we first examine the supervised learning of NNs with or 
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Table 2: Classification error rates (in %) on the MNIST test set using a shallow NN with one 
hidden layer of different hidden units. Neither dropout nor data augmentation is used here 
for quick turnaround. Two numbers in brackets, \M,K], indicate a HOPE model with the 
orthogonal constraint in the projection, and two number in parentheses, {M,K), indicate the 


same model structure without imposing the orthogonal constraint in t 


re projection. 


Net Architecture / K= 

Ik 

2k 

5k 

10k 

50k 

Baseline: 784-K-lO 

1.49 

1.35 

1.28 

1.30 

1.32 

HOPEI: 784-[100-K]-10 

1.18 

1.20 

1.17 

1.18 

1.19 

HOPE2: 784-[200-K]-I0 

1.21 

1.20 

1.17 

1.19 

1.18 

HOPES: 784-[400-K]-I0 

1.19 

1.23 

1.25 

1.25 

1.25 

Linearl: 784-(100-K)-10 

1.45 

1.49 

1.43 

1.45 

1.48 

Linear2: 784-(200-K)-10 

1.52 

1.50 

1.54 

1.55 

1.54 

LinearS: 784-(400-K)-10 

1.53 

1.52 

1.49 

1.52 

1.49 


without imposing the orthogonal constraints to all projection layers. Firstly, we investigate 
the performance of a neural network containing only a single hidden layer, decomposed into a 
pair of a linear projection layer and a nonlinear model layer. Here we evaluate neural networks 
with a different number of hidden units {K) and a varying size of the projection layer (M). 
From the experimental results shown in Table. we can see that the HOPE-trained NN can 
achieve much better performance than the baseline NN, especially when smaller values are used 
for M. This supports that the projection layer may eliminate residual noises in data to avoid 
over-fitting when M is properly set. However, after we relax the orthogonal constraint in the 
HOPE model, as shown in Table the performance of the models using only linear projection 
layers gets much worse than those of the HOPE models as well as that of the baseline NN. These 
results verify that orthogonal projection layers are critical in the HOPE models. Furthermore, 
in Figure. Ill we have plotted the learning curves of the total sum of all correlation coefficients 

I I 

among all row vectors in the learned projection matrix U, i.e., , . We can see that 

all projection vectors tend to get strongly correlated (especially when M is large) in the linear 
projection matrix as the learning proceeds. On the other hand, the orthogonal constraint can 
effectively de-correlate all the projection vectors. Moreover, we show all correlation coefficients, 
i.e., I I , of the linear projection matrix and the HOPE orthogonal projection matrix as two 
images in Figure which clearly shows that the linear projection matrix has many strongly 
correlated dimensions and the HOPE projection matrix is (as expected) orthogonal . 

As the MNIST training set is very small, we further use the dropout technique in 
to improve the model learning on the MNIST task. In this experiment, the visible dropout 
probability is set to 20% and the hidden layer dropout probability is set to 30%. In Table 
we compare a 1-hidden-layer shallow NN with two HOPE models (M=200,400). The results 
show that the HOPE framework can significantly improve supervised learning of NNs. Under 
the HOPE framework, we can train very simple shallow neural networks from scratch, which 
can yield comparable performance as deep models. For example, on the MNIST task, as shown 
in Table we may achieve 0.85% in classification error rate using a shallow neural network 
(with only one hidden layer of 2000 nodes) trained under the HOPE framework. Furthermore, 
we consider to build deeper models (two-hidden-layer NNs) under the HOPE framework. Using 


20 
























HOPE (Zhang and Jiang) 


21 



- HOPE-1000 
HOPE-5000 
-HOPE-1 0000 
HOPE-50000 
Linear-1000 
Linear-5000 
' Linear-10000 
Linear-50000 


10 20 30 40 50 60 70 


Figure 4: The learning curves of the total sum of correlation coefficients of the linear projection 
and orthogonal HOPE projection matrices, respectively, left: M=200, i('=lk,5k,10k,50k; right: 
M=400, ii:=lk,5k,10k,50k. 



Figure 5: The correlation coefficients of the linear projection matrix (left) and the orthogonal 
HOPE projection matrix (right) are shown as two images. Here M = 400 and K = 1000. 
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Table 3: Comparison of classification error rates (in %) on the MNIST test set between a 
shallow NN and two HOPE-trained NNs with the same model structure. Dropout is used in 
this experiment. 


Net Architecture / K= 

Ik 

2k 

5k 

NN baseline: 784-K-10 

1.05 

1.01 

1.01 

HOPEI: 784-[200-K]-10 

0.99 

0.85 

0.89 

HOPE2: 784-[400-K]-10 

0.86 

0.86 

0.85 


Table 4: Comparison of classification error rates (in %) on the MNIST test set between a 2- 
hidden-layer DNN and two HOPE-trained DNNs with similar model structures, with or without 
using dropout. 


model 

Net Architecture 

without dropout 

with dropout 

DNN baseline 

784-1200-1200-10 

1.25 

0.92 

HOPE -h NN 

784-[400-1200]-1200-10 

0.99 

0.82 

HOPE*2 

784- [400-1200]- [400-1200]-10 

0.97 

0.81 


the two different structures in Eigure we can further improve the classification error rate to 
0.81%, as shown in Table. To the best of our knowledge, this is one of the best results 
reported on MNIST without using CNNs and data augmentation. 


6.1.3 Semi-supervised Learning on MNIST 

In this experiment, we combine the unsupervised feature learning with supervised model learning 
and examine the classification performance when only limited labelled data is available. Here 
we also list the results using convolutional deep belief networks (CDBN) in 21 as a baseline 
system. In our experiments, we use the raw pixel features and unsupervised learned (USE) 


features in section 6.1.1[ As example, we choose the unsupervised learned features from the 
HOPE-movMF model {K = 800) in Table 1 ^ Next, we concatenate the features to a post-stage 
classifier, which is supervised trained using only a portion of the training data, ranging from 
1000 to 60000 (all). We consider many different types of classifiers here, including linear SVM, 
regular DNNs and HOPE-trained DNNs. Note that all classifiers are trained separately from 
the feature learning. All results are summarized in Table It shows that we can achieve the 
best performance when we combine the HOPE-trained USL features with HOPE-trained post¬ 
stage classifiers. The gains are quite dramatic no matter how much labelled data is used. For 
example, when only 5000 labelled samples are used, our method can achieve 0.90% in error rate, 
which significantly outperforms all other methods including CDBN in [^. At last, as we use 
all training data for the HOPE model, we can achieve 0.40% in error rate. To the best of our 
knowledge, this is one of the best results reported on MNIST without using data augmentation. 
Furthermore, our best system uses a quite simple model structure, consisting of a HOPE-trained 
feature extraction layer of 800 nodes and a HOPE-trained NN of two hidden layers (1200 node 
in each layer), which is much smaller and simpler than those top-performing systems on MNIST. 


'For the HOPE-movMF model with K — 800, there are 115 empty clusters. Thus, the unsupervised learned 
features are of 2740 in dimension. 
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Table 5: Classification error rates (in %) on the MNIST test set using raw pixel features or 
unsupervised learned features, along with different post-stage classifiers trained separately by 
limited labeled data. Here USL denotes the unsupervised learned features from HOPE-movMF 
{K = 800). All used classifiers include: DNNl (784-1200-1200-10); HOPE-DNNl (784-[400- 
1200]-10); HOPE-DNN2 (784-[400-1200]-1200-10); DNN2 (2740-1200-1200-10); HOPE-DNN3 
(2740-[400-1200]-10); HOPE-DNN4 (784-[400-1200]-1200-10). _ 


labeled training samples 

1000 

2000 

5000 

10000 

20000 

60000(ALL) 

CDBN |21| 

2.62 

2.13 

1.59 

- 

- 

0.82 

Raw feature-|-DNNl 

8.32 

4.71 

3.2 

2.15 

1.48 

0.92 

Raw feature-|-HOPE-DNNl 

8.22 

4.53 

2.92 

2.04 

1.34 

0.86 

Raw feature-|-HOPE-DNN2 

7.21 

4.02 

2.60 

1.83 

1.30 

0.82 

USL feature-blinear SVM 

2.91 

2.38 

1.47 

1.13 

0.90 

0.71 

USL feature-FDNN2 

2.83 

1.99 

1.03 

0.88 

0.70 

0.43 

USL feature-FHOPE-DNN3 

2.50 

1.78 

0.95 

0.87 

0.67 

0.42 

USL feature-FHOPE-DNN4 

2.46 

1.70 

0.90 

0.79 

0.66 

0.40 


6.2 TIMIT: Speech Recognition 


In this experiment, we examine the supervised learning of shallow and deep neural networks 
under the HOPE framework for a standard speech recognition task using the TIMIT data set. 
The HOPE-based supervised learning method is compared with the regular back-propagation 
training method. We use the minimum cross-entropy learning criterion here. 

The TIMIT speech corpus consists of a training set of 462 speakers, a separate development 
set of 50 speakers for cross-validation, and a core test set of 24 speakers. All results are reported 
on the 24-speaker core test set. The speech waveform data is analyzed using a 25-ms Hamming 
window with a 10-ms fixed frame rate. The speech feature vector is generated by a Fourier- 
transform-based hlter-banks that include 40 coefficients distributed on the Mel scale and energy, 
along with their first and second temporal derivatives. This leads to a 123-dimension feature 
vector per speech frame. We further concatenate 15 consecutive frames within a long context 
window of (7-I-1-I-7) to feed to the models, as 1845-dimension input vectors [^. All speech data 
are normalized by subtracting the mean of the training set and being divided by the standard 
deviation of the training set on each dimension so that all input vectors have zero mean and 
unit variance. We use 183 target class labels (i.e., 3 states for each of the 61 phones) for the 
DNN training. After decoding, these 61 phone classes are mapped to a set of 39 classes for the 
final scoring as in 22 . In our experiments, a bi-gram language model at phone level, estimated 


from all transcripts in the training set, is used for speech recognition. 

We hrst train ReLUs based shallow and deep neural networks as our baseline systems. The 
networks are trained using the back-propagation algorithm, with a mini-batch size of 100. The 
initial learn rate is set to 0.004 and it is kept unchanged if the error rate on the development 
set is still decreasing. Afterwards, the learning rate is halved after every epoch, and the whole 
training procedure is stopped when the error reduction on the development set is less than 0.1% 
in two consecutive iterations. In our experiments, we also use momentum and weight decay, 
which are set to 0.9 and 0.0001, respectively. When we use the mini-batch SGD to train neural 
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Table 6: Supervised learning of neural networks on TIMIT with and without using the HOPE 
framework. The two numbers in a bracket, [M, JC], indicate a HOPE model with M-dimension 
features and K mixture components. FACC: frame classification accuracy from neural networks. 
PER: phone error rate in speech recognition. 


model 

Net Architecture 

FACC (%) 

PER (%) 

NN 

1845-10240-183 

61.45 

23.85 

HOPE-NN 

1845-[256-10240]-183 

62.11 

23.04 

DNN 

1845-3*2048-183 

63.13 

22.37 

HOPE-DNN 

1845-[512-2048]-2*2048-183 

63.55 

21.59 


networks under the HOPE framework, the control parameter for the orthogonal constraints, i.e. 
/3, is set to be 0.01. 

In our experiments, we compare the standard NNs with the HOPE-trained NNs for two 
network architectures, one shallow network with one hidden layer of 10240 hidden nodes and 
one deep network with 3 hidden layers of 2048 nodes. The performance comparison between 
them is shown in Table. Our results are comparable with and another recent work [^, 
using deep neural networks on TIMIT. From the results, we can see that the HOPE-trained 
NNs can consistently outperform the regular NNs by an about 0.8% absolute reduction in phone 
recognition error rates. Moreover, the HOPE-trained neural networks are much smaller than 
their counterpart DNNs in number of model parameters if the HOPE layers are not merged. 
After merging, they have the exactly the same model structure as their counterpart NNs. 

7 Final Remarks 

In this paper, we have proposed a novel model, called hybrid orthogonal projection and esti¬ 
mation (HOPE), for high-dimensional data. The HOPE model combines feature extraction and 
data modeling under a unified generative modeling framework so that both feature extractor and 
data model can be jointly learned either supervised or unsupervised. More interestingly, we have 
shown that the HOPE models are closely related to neural networks in a way that each hidden 
layer in NNs can be reformulated as a HOPE model. Therefore, the proposed HOPE related 
learning algorithms can be easily applied to perform either supervised or unsupervised learning 
for neural networks. We have evaluated the proposed HOPE models in learning NNs on several 
standard tasks, including image recognition on MNIST and speech recognition on TIMIT. Ex¬ 
perimental results have strongly supported that the HOPE models can provide a very effective 
unsupervised learning method for NNs. Meanwhile, the supervised learning of NNs can also 
be conducted under the HOPE framework, which normally yields better performance and more 
compact models. 

We are currently investigating the HOPE model to learn convolution neural networks (CNNs) 
for more challenging image recognition tasks, such as CIFAR and ImageNet. At the same time, 
we are also examining the HOPE-based unsupervised learning for various natural language 
processing (NLP) tasks. These results will be reported in the future. 
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Appendix 

A Learning HOPE when U is not orthonormal 

In some tasks, in additional to dimension reduction, we may want to use the projection matrix 
U to perform signal whitening to ensure the signal projection z has roughly the same variance 
along all dimensions. This has been shown to be quite important for many image recognition and 
speech recognition tasks. In this case, we may still want to impose the orthogonal constraints 
among all row vectors of U, i.e., u* • Uj = 0 (i, j = I, • • • , M, i ^ j), but these row vectors may 
not be of unit length, i.e., |uj| >1 (i = I,-- - ,M). Moreover, it is better not to whiten the 
residual noises in the remaining D — M dimensions to amplify them unnecessarily. Therefore, 
we still enforce the unit-length constraints for the matrix V, i.e., |uj| = 1 (i = M + 1, - ■ ■ ,D). 
Because U is not orthonormal anymore, when we compute the likelihood function of the original 
data in eq. © for HOPE, we have to include the Jacobian term as follows: 


N 


N r 


£(U, 0, (7 I X) = In Pr(x„) = 


n=l 

N / K 


n=l 


ln|U ^1-|-lnPr(z„)-|-lnPr(n„) 


N 


-111111 + ^ TTfc •/fc(Ux„|0fc) In ( AA(n„ I 0,(T^I) 


T(U) 


n=l \k=l 


I n=l 


/:i(u,©) 


C2(lJ,a) 


Because UU^ ^ I here, we have 


x„ = U' (UU^ )-'z„ + n„ 

and then we have 

x^x„ = (^n^V + z^(UU'^)-iU^(^V^n„ + U^(UU^)-iz, 

= n^n„ -h z^(UU^)“^z„ = x^U^(UU'^)"^Ux^ 

Therefore, we may derive the residual noise energy as: 

n^n„ = x^x„ - x^U^(UU^)“^Ux„ 
I-U^(UU^)-iU 


= x^ 


Xr, 


In this case, T2(U, o) can be expressed as: 


N 


£2(U,(T) = -yln(l72)-^^ 


n=l 


x^x„ - x^U^(UU^)-^Ux 


Therefore, its gradient with respect to U, can be derived as follows: 


(44) 

(45) 


(46) 


(47) 


(48) 
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dC2{\].a) 

dV 


1 ^ 


(UU'^)-^Ux„x^ - (UU" )-^Ux„x^U^ (UU^ )-'U 


Tn- 1 tT„ „TTTT/TTTTT^-l^ 


n=l 
N 


(UU^)-iUx„x 


n=l 


I-U^(UU^)-iU 


(49) 


Next, we consider the Jacobian term, J7(U), which can be computed as follows: 

M 


J(U) = A^-ln|U-i| =-NY^ ln|ui| 


(50) 


2=1 


Because of (* = !)••• , M), it is easy to show that its derivative with respect 


to U can be derived as: 


dJ(U) 

dV 


= _Ar.(uu^)-^U 


(51) 


Similarly, the HOPE parameters, i.e., U, © and a, can be estimated by maximizing the 
above likelihood function as follows: 


{U*,0*,^7*} = argmaxu,e,<, £(U,0,a| X) 
subject to an orthogonal constraint: 


(52) 


UU^ = $, (53) 

where $ is a diagonal matrix. The above constraint can also be implemented as an penalty 


term similar to eq.(15). However, the norm of each row vector is relaxed as follows: 

|uj| >1 = ,M) 


(54) 


The log-likelihood function related to the signal model, £i(U, 0), and signal variance, 
are calculated in the same way as before. 

B Derivatives of movMFs 


The partial derivatives of the objective function in eq.(29) w.r.t all can be computed as 
follows: 






E 

n=l 


where we have 


EjLiTTj -CMdAtjl) 

9 l/^fcl ^ 


(55) 

(56) 
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As to C'j^{k), for brevity, let us denote s = ^ — 1 , and ^ = (27r)^'''^. 

s ■ (s I's{l^) 




CIs{k) ^Is‘^{k) CIs{k)\k A(k) 


= c«(«) ■ I -- 


(57) 


K Is{k) 

where /^(•) denotes the Bessel function of the first kind at the order v. Because we have 


Thus, we may derive 


T f \ - T'f \ T / \ - ^s+i{k) 

Is + 1 (^) 




Is{n) 


(58) 


Substituting eq. (56) and eq. (58) into eq. (55), we obtain the partial derivative of the 
objective function in eq. (29) w.r.t fj./. as follows: 


5A(U,©) 


« . Cm{M) . e---. (z„ - 




E 

n=l 


^ _ fJ-k -^M/2(lAtfcl) \ 

iMfcl ^M/ 2 -l{\fJ'k\) J 


N 




n=l 


(59) 


where 'y{znk) = ^ is the occupancy statistics of k-th component of z„. 

Ej=i ■'’■j'CMdMjlj-e ^ 

Next, let us consider the partial derivative of the objective function in eq.(29) w.r.t U. Based 
on the chain rule, we have 


( 9 £i(U, 0 ) d£i(U, 0 ) dzn f dz^ ( 9 £i(U, 0 )^ dz^, 


dU dz 

Furthermore, we may derive 
N K 


( 9 U V dz 


dZr, 


(60) 


d£l(U, 0 ) _ T^k ■ ^M{\fJ-k\) ■ e " 


N K 


dZr. 


Ef=i ■ Cm 


Mfc = EE liZnk) ■ fJ^k ( 61 ) 


n=l k=l 


dK _d{zi/\in\) _ 1 fdii: : d\in\ T 


dZr, 


dz. 


1 




dz, 

1 

Ztj, I 


dz. 


I - ZnZ^ 


(62) 


Substituting eq.(61) and eq. (62) into eq. (60), we can obtain 

d£l(U, 0 ) 7 ( 2 :nfc) T\ 7 


(63) 


n=lk=l 
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C Numerical Methods for /^(•) 


In the learning algorithm for movMFs, we may need to compute the Bessel functions, /^(•), in 
several places. First of all, we need to compute the normalization term Cm{i^) when calculating 
the likelihood function of a vMF distribution as in eq.®. Second 


rations of the modified Bessel functions, Ad{K) = 






y, we need to calculate the 
, in eq. (59). As we know, the modified 


Bessel functions of the first kind take the following form: 




1 


'■^\2k+d 


k>0 


r{d + l + k)k\^ 2 - 


(64) 


From eq. (64), we can see that when d, Id{n) overflows quite rapidly. Meanwhile, when 
oo, /^(k) underflows quite rapidly. In this work, we use the approximation 


K = o{d) and d 
strategy in eq. (9.7.7) on page 378 of as follows: 


Id{K) 


odf) 


•J^d (i + lSP)'/- 


i + E 


k=l 


Uk{t) 

d^ 


where we have 


t = 


yi + («:/d)2 


7] = v^I + (K/d)2 + In ■ 


n/d 


(65) 

( 66 ) 

(67) 


1 + y^I + (k/ d)^ 

with the functions Uk{t) taking the following forms: 

uo{t) = 1 

ui{t) = (3t-5t^)/24 

U2{t) = (Slt^ - 462t'^ + 385t®)/1152 

Refer to page 366 of for other higher orders Uk(t). 

Usually, the sum of the term [1 + X]^i in eq.(^65| is very small and it is safe to eliminate 
it from evaluation in most cases. Then, after substituting eq.(66) and eq.(67) into eq.(65), the 
logarithm of the approximated modified Bessel function is finally computed as follows: 


Inldin) = -lnV2^ + d- ^-^/l + (K/ci)^ + In ^ - In ^1 + \/l + (K/d)^^^ 


( 68 ) 


In this work, the approximation in eq.(68) is used to compute all Bessel functions in the 
learning algorithms for movMFs. 
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